38 research outputs found
ESSOP: Efficient and Scalable Stochastic Outer Product Architecture for Deep Learning
Deep neural networks (DNNs) have surpassed human-level accuracy in a variety
of cognitive tasks but at the cost of significant memory/time requirements in
DNN training. This limits their deployment in energy and memory limited
applications that require real-time learning. Matrix-vector multiplications
(MVM) and vector-vector outer product (VVOP) are the two most expensive
operations associated with the training of DNNs. Strategies to improve the
efficiency of MVM computation in hardware have been demonstrated with minimal
impact on training accuracy. However, the VVOP computation remains a relatively
less explored bottleneck even with the aforementioned strategies. Stochastic
computing (SC) has been proposed to improve the efficiency of VVOP computation
but on relatively shallow networks with bounded activation functions and
floating-point (FP) scaling of activation gradients. In this paper, we propose
ESSOP, an efficient and scalable stochastic outer product architecture based on
the SC paradigm. We introduce efficient techniques to generalize SC for weight
update computation in DNNs with the unbounded activation functions (e.g.,
ReLU), required by many state-of-the-art networks. Our architecture reduces the
computational cost by re-using random numbers and replacing certain FP
multiplication operations by bit shift scaling. We show that the ResNet-32
network with 33 convolution layers and a fully-connected layer can be trained
with ESSOP on the CIFAR-10 dataset to achieve baseline comparable accuracy.
Hardware design of ESSOP at 14nm technology node shows that, compared to a
highly pipelined FP16 multiplier design, ESSOP is 82.2% and 93.7% better in
energy and area efficiency respectively for outer product computation.Comment: 5 pages. 5 figures. Accepted at ISCAS 2020 for publicatio
Accurate deep neural network inference using computational phase-change memory
In-memory computing is a promising non-von Neumann approach for making
energy-efficient deep learning inference hardware. Crossbar arrays of resistive
memory devices can be used to encode the network weights and perform efficient
analog matrix-vector multiplications without intermediate movements of data.
However, due to device variability and noise, the network needs to be trained
in a specific way so that transferring the digitally trained weights to the
analog resistive memory devices will not result in significant loss of
accuracy. Here, we introduce a methodology to train ResNet-type convolutional
neural networks that results in no appreciable accuracy loss when transferring
weights to in-memory computing hardware based on phase-change memory (PCM). We
also propose a compensation technique that exploits the batch normalization
parameters to improve the accuracy retention over time. We achieve a
classification accuracy of 93.7% on the CIFAR-10 dataset and a top-1 accuracy
on the ImageNet benchmark of 71.6% after mapping the trained weights to PCM.
Our hardware results on CIFAR-10 with ResNet-32 demonstrate an accuracy above
93.5% retained over a one day period, where each of the 361,722 synaptic
weights of the network is programmed on just two PCM devices organized in a
differential configuration.Comment: This is a pre-print of an article accepted for publication in Nature
Communication
Mixed-precision deep learning based on computational memory
Deep neural networks (DNNs) have revolutionized the field of artificial
intelligence and have achieved unprecedented success in cognitive tasks such as
image and speech recognition. Training of large DNNs, however, is
computationally intensive and this has motivated the search for novel computing
architectures targeting this application. A computational memory unit with
nanoscale resistive memory devices organized in crossbar arrays could store the
synaptic weights in their conductance states and perform the expensive weighted
summations in place in a non-von Neumann manner. However, updating the
conductance states in a reliable manner during the weight update process is a
fundamental challenge that limits the training accuracy of such an
implementation. Here, we propose a mixed-precision architecture that combines a
computational memory unit performing the weighted summations and imprecise
conductance updates with a digital processing unit that accumulates the weight
updates in high precision. A combined hardware/software training experiment of
a multilayer perceptron based on the proposed architecture using a phase-change
memory (PCM) array achieves 97.73% test accuracy on the task of classifying
handwritten digits (based on the MNIST dataset), within 0.6% of the software
baseline. The architecture is further evaluated using accurate behavioral
models of PCM on a wide class of networks, namely convolutional neural
networks, long-short-term-memory networks, and generative-adversarial networks.
Accuracies comparable to those of floating-point implementations are achieved
without being constrained by the non-idealities associated with the PCM
devices. A system-level study demonstrates 173x improvement in energy
efficiency of the architecture when used for training a multilayer perceptron
compared with a dedicated fully digital 32-bit implementation
Circuit knitting with classical communication
The scarcity of qubits is a major obstacle to the practical usage of quantum
computers in the near future. To circumvent this problem, various circuit
knitting techniques have been developed to partition large quantum circuits
into subcircuits that fit on smaller devices, at the cost of a simulation
overhead. In this work, we study a particular method of circuit knitting based
on quasiprobability simulation of nonlocal gates with operations that act
locally on the subcircuits. We investigate whether classical communication
between these local quantum computers can help. We provide a positive answer by
showing that for circuits containing nonlocal CNOT gates connecting two
circuit parts, the simulation overhead can be reduced from to
if one allows for classical information exchange. Similar improvements can be
obtained for general Clifford gates and, at least in a restricted form, for
other gates such as controlled rotation gates.Comment: v2: 20 pages, 6 figures; minor typos fixe
Quantum message-passing algorithm for optimal and efficient decoding
Recently, Renes proposed a quantum algorithm called belief propagation with quantum messages (BPQM) for decoding classical data encoded using a binary linear code with tree Tanner graph that is transmitted over a pure-state CQ channel [1], i.e., a channel with classical input and pure-state quantum output. The algorithm presents a genuine quantum counterpart to decoding based on the classical belief propagation algorithm, which has found wide success in classical coding theory when used in conjunction with LDPC or Turbo codes. More recently Rengaswamy et al. [2] observed that BPQM implements the optimal decoder on a small example code, in that it implements the optimal measurement that distinguishes the quantum output states for the set of input codewords with highest achievable probability. Here we significantly expand the understanding, formalism, and applicability of the BPQM algorithm with the following contributions. First, we prove analytically that BPQM realizes optimal decoding for any binary linear code with tree Tanner graph. We also provide the first formal description of the BPQM algorithm in full detail and without any ambiguity. In so doing, we identify a key flaw overlooked in the original algorithm and subsequent works which implies quantum circuit realizations will be exponentially large in the code dimension. Although BPQM passes quantum messages, other information required by the algorithm is processed globally. We remedy this problem by formulating a truly message-passing algorithm which approximates BPQM and has quantum circuit complexity O(poly n,polylog 1ϵ), where n is the code length and ϵ is the approximation error. Finally, we also propose a novel method for extending BPQM to factor graphs containing cycles by making use of approximate cloning. We show some promising numerical results that indicate that BPQM on factor graphs with cycles can significantly outperform the best possible classical decoder.ISSN:2521-327
Quasiprobability decompositions with reduced sampling overhead
Quantum error-mitigation techniques can reduce noise on current quantum hardware without the need for fault-tolerant quantum error correction. For instance, the quasiprobability method simulates a noise-free quantum computer using a noisy one, with the caveat of only producing the correct expected values of observables. The cost of this error mitigation technique manifests as a sampling overhead which scales exponentially in the number of corrected gates. In this work, we present an algorithm based on mathematical optimization that aims to choose the quasiprobability decomposition in a noise-aware manner. This directly leads to a significantly lower basis of the sampling overhead compared to existing approaches. A key element of the novel algorithm is a robust quasiprobability method that allows for a tradeoff between an approximation error and the sampling overhead via semidefinite programming.ISSN:2056-638
Délocalisations : de quoi parle-t-on ? De la quantification des opérations à la qualification des processus
Relocations are usually interpreted as part of economic globalisation. Recent qualitative trends give a new picture of relocations and put into question our usual understanding of the phenomena. “Relocations: what are we talking about” goes back to the definition and quantification of relocations and proposes to interpret these new trends through firms’ mobility and the rationality of their location choices. We finally discuss some analytical ways of understanding and explaining what is at stake in the relocation process at the firm and industry level